This document outlines the methodology for data cleaning, exploration, and visualization. It is structured to ensure transparency and reproducibility of all analyses.
Briefly describe the methods used in the project, including data sources, cleaning steps, and techniques applied to handle missing or inconsistent data.
Here is an inline equation: \(E = mc^2\).
And here is a displayed equation:
\[\begin{equation} f(x) = \int_{-\infty}^{\infty} e^{-x^2} \, dx \end{equation}\]
Both inline and block math can be rendered seamlessly with LaTeX.
You can also create multiline equations with alignment:
\[\begin{equation} \begin{aligned} a &= b + c \\ d &= e + f \end{aligned} \end{equation}\]
The first step in any data analysis is to load the data, options, and packages. This section installs and loads the necessary packages, sets up the output paths, and defines chunk options for the RMarkdown document.
The following code loads the data from various sources, including CSV files, Excel files, and shapefiles. The here package is used to create file paths that are relative to the project directory, ensuring that the code is portable and reproducible.
# Example of loading data from a CSV file
# raw_data <- read_csv(here("Data", "Raw", "data_file.csv"))
# raw_data <- read_excel(here("Data", "Raw", "data_file.xlsx"))
# raw_data <- read_sf(here("Data", "Raw", "data_file.shp"))
The data cleaning process involves several steps to ensure the data is in a suitable format for analysis. This includes handling missing values, correcting data types, and removing duplicates.
# cleaning the data using dplyr and tidyr
# cleaned_data <- raw_data %>%
# clean_names() %>%
# mutate(column_name = as_factor(column_name)) %>%
# mutate(date_column = as.Date(date_column, format = "%Y-%m-%d")) %>%
# mutate(numeric_column = as.numeric(numeric_column)) %>%
# mutate(accross(everything(), ~str_squish(.))) %>% =
# drop_na()
Data exploration is a crucial step in understanding the dataset and identifying patterns or anomalies. This section includes summary statistics, visualizations, and any other relevant analyses to gain insights into the data.
# Explore the cleaned data using basic summaries:
# glimpse(cleaned_data)
# summary(cleaned_data)
# str(cleaned_data)
Data visualization is an essential part of data analysis, allowing for the communication of findings in a clear and effective manner. This section includes various plots and charts to illustrate key insights from the data.
# Example of creating a summary table
# summary_table <- cleaned_data %>%
# group_by(group_var) %>%
# summarise(mean_value = mean(value_var, na.rm = TRUE)) %>%
# ungroup() %>%
# kable() %>%
# kable_styling(full_width = F, position = "left")
# save_kable(summary_table,
# file = here(output_path_tables, "summary_table.html"),
# bootstrap_options = c("striped", "hover", "condensed"),
# full_width = F, position = "left")
# Example of visualization plot
data("diamonds")
# First Plot (p1) - Scatter plot with smoothing line
p1 <- ggplot(
subset(diamonds, carat >= 2.2),
aes(x = table, y = price, colour = cut)
) +
geom_point(alpha = 0.7) +
geom_smooth(method = "loess", alpha = 0.05,
linewidth = 1, span = 1, formula = y ~ x) +
theme_bw() + # White background theme
theme(
plot.title = element_text(hjust = 0.5), # Center the title
legend.position = "top", # Position the legend at the top
text = element_text(size = 10), # Set text size
legend.key.size = unit(0.5, "cm") # Legend key size
)
# Second Plot (p2) - Histogram of depth for specific carat ranges
p2 <- ggplot(
subset(diamonds, carat > 2.2 & depth > 55 & depth < 70),
aes(x = depth, fill = cut)
) +
geom_histogram(colour = "black",
binwidth = 1, position = "dodge") + # Dodged histogram
theme_bw() + # White background theme
theme(
plot.title = element_text(hjust = 0.5), # Center the title
legend.position = "top", # Position the legend at the top
text = element_text(size = 10), # Set text size
legend.key.size = unit(0.5, "cm") # Legend key size
)
p1_npg <- p1 + scale_color_npg()
p2_npg <- p2 + scale_fill_npg()
grid.arrange(p1_npg, p2_npg, ncol = 2)
Example of a scatter plot
# Example of saving a plot
# ggsave(filename = here(output_path_images, "plot_name.png"),
# plot = last_plot(), width = 6, height = 4)
Citing the packages and data used in the analysis is important for reproducibility and transparency. The following code generates a bibliography of all loaded packages. Items can be cited directly within the documentation using the syntax @key where key is the citation key in the first line of the entry, e.g., R Core Team (2024), Wickham et al. (2024), Wickham (2023), Müller (2020). To put citations in parentheses, use [@key] instead.